References



Intuition behind LD score regression

In genetics, the standard additive model is

\[\begin{equation} \tilde{y_i}=\sum_{j=1}^J \beta_jx_{ij} +\epsilon_i \end{equation}\]

where \(y_i\) measures our phenotype of interest, \(x_{ij}\) is the genotype matrix and \(\beta_j\) measures the effect size of SNP \(j\) on the phenotype.

The data is typically standardised so that \(var(\tilde{y_i})=1\) and all \(var(x_{ij})=1\), which implicitly assumes a relationship between \(\beta_j\) and MAF (e.g. that rarer things (smaller MAF) typically have a larger effect size to compensate). There are two extremes for this standardisation step: (i) once we assume a constant variance for \(\beta\), the variance explained by each SNP is the same (so that rarer things have a larger effect size to compensate) and (ii) no standardisation so that the distribution of effect sizes is the same and doesn’t depend on allele frequency. Realistically, it is somewhere between these extremes and will be trait specific.

In the genome, SNPs are correlated with one another and so from a GWAS we can estimate the marginal effects,

\[\begin{equation} \hat{\beta_j}^{GWAS}=s_j+\sum_{k=1}^J \beta_k r_{x_{i,j},x_{i,k}}+\epsilon_j \end{equation}\]

where \(s_j\) is some bias from confounders (e.g. population stratification or relatedness) and \(r_{x_{i,j},x_{i,k}}\) is the correlation between SNPs \(x_j\) and \(x_k\).

For each SNP, we calculate a \(\chi^2\) association statistic which estimates the effect size. If we define the LD score of SNP \(j\) as

\[\begin{equation} l_j=\sum_{k=1}^J r^2_{x_{i,j}, x_{i,k}} \end{equation}\]

then our expected \(\chi^2\) can be shown to equal

\[\begin{equation} E(\chi^2_j)=1+N\alpha+\dfrac{Nh^2_{SNP}}{M} l_j \end{equation}\]

where N is the sample size, \(\alpha\) is a measure of confounding and M is the number of SNPs. This relationship between \(\chi^2\) value and LD score is intuitive because the more things you tag (and the degree with which you tag), the more likely you are to tag a CV. More formally, “assuming a uniform prior, we see SNPs with more LD friends showing more association”. Note that \(h^2=\sum_j\beta_j^2\).

So, if we regress our \(\chi^2\) values from the GWAS on \(Nl_j\) for each SNP \(j\), we get:

This method was first used to distinguish between population stratification (where there will be no relationship between LD score and \(\chi^2\) association statistic) and actually interesting polygenic effects (where there will be a positive relationship between LD score and \(\chi^2\) association statistic) by examining the LD score regression intercept. This was compared with \(\lambda_{GC}\) values (with which the observed \(\chi^2\) values are divided by in the genomic-control method) to show that genomic control is unnecessarily conservative (LD score intercept \(<\lambda_{GC}\)).


Key points

  • LD score regression was developed as a tool to distinguish confounding from polygenicity in GWAS using only summary statistics.

  • It’s development was based on the fact that \(\chi^2\) values for true associations are positively correlated with LD scores whereas \(\chi^2\) values for false positives (e.g. due to population stratification/drift) are not correlated with LD scores.

  • The intercept of the \(\chi^2 \sim LD score\) regression estimates confounding (\(=1\) if no confounding) similarly (but arguably better than) \(\lambda_{GC}\).

  • An extention of LD score regression is stratified LD score regression, which aims to partition heritability by functional annotation.


Stratified LD score regression

We have previously assumed that

\[\begin{equation} Var(\beta_j)=\dfrac{h^2_{SNP}}{M} \end{equation}\]

i.e. that heritability from each SNP is on average the same genome wide. But what if we want to evaluate whether there are regions of the genome with stronger effects (i.e. higher \(Var(\beta_j)\))?

To do this, we allow the variance to vary between functional categories (\(C\)),

\[\begin{equation} Var(\beta_j)=\sum_{c:j\in C_c}\tau_c \end{equation}\]

with disjoint categories

\[\begin{equation} h^2_{SNP}(C_c)=\sum_{j\in C_c}\beta_j^2=\tau_c\times M(C_c) \end{equation}\]

otherwise we’re assuming overlapping categories act additively on the total variance.


The stratified LD score model now looks like,

\[\begin{equation} E(\chi^2_j)=1+N\alpha+N\sum_C \tau_c l_{j,C} \end{equation}\]

where C is some functional category. I.e. rather than summing for all LD friends, we are now summing for all LD friends which are also in some functional category \(c\). We can estimate \(\tau_c\) via multiple regression with \(l_{j,c}\) computed from reference data for a choice of annotation, where \(\tau_c\) is the per SNP contribution to heritability of category \(c\).

There are two ways to evaluate partitioned heritability results:

  1. Enrichment of effects in a single annotation.
    • \(h^2_{SNP}(C_c)=\tau_C\times M(C_c)\)
    • \(Enrichment=\dfrac{h^2_SNP(C_c)/M(C_c)}{h^2_SNP/M}\) (the per SNP heritability in annotation C divided by the genome wide average heritability per SNP).
  2. Enrichment conditional on other annotations.
    • I.e. whether \(\tau_c\) differs from 0 in multiple linear regression
    • Important for highly overlapping annotations

Full derivations can be found here.

Details on annotations

  • It is often useful to define buffer regions around annotations. For example, rather than a binary 0/1 for whether the SNP falls in an annotation, it may be important to know whether a SNP lies very close to the boundaries of these annotations. For this reason, additional annotations can be defined for SNPs falling in these buffer regions (e.g. all annotations plus all annotations + buffer region).

  • Can extend to continuous annotations (rather than 0/1 whether it is in the annotation or not; https://www.nature.com/articles/ng.3954).

  • Used to make statements like “variants for BMI are enriched in regions that suggest active marks in CNS cells”.


Advantages of stratified LD score regression

  • Only requires summary statistics.

  • Does not assume a single CV per region.

  • Does not only use SNPs either reaching genome-wide significance or falling in genome-wide significant regions.

  • Accounts for LD.

  • Computationally efficient.

Drawbacks of stratified LD score regression

  • Requires large data sets and/or large SNP heritability.

  • Trait analysed must be polygenic.

  • Requires an LD reference panel matched to the population studied.

  • Not application to studies using custom genotyping arrays (due to using 1000 genomes data to find LD scores that need to be generalsied to the study SNPs).

  • Based on additive model and does not consider non-additive effects.